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Abstract 

Stochastic  context-free  grammars  (SCFGs)  are  often 
used  to  represent  the  syntax  of  natural  languages.  Most 
algorithms  for  learning  them  require  storage  and  re¬ 
peated  processing  of  a  sentence  corpus.  The  memory 
and  computational  demands  of  such  algorithms  are  ill- 
suited  for  embedded  agents  such  as  a  mobile  robot.  Two 
algorithms  are  presented  that  incrementally  learn  the 
parameters  of  stochastic  context-free  grammars  as  sen¬ 
tences  are  observed.  Both  algorithms  require  a  fixed 
amount  of  space  regardless  of  the  number  of  sentence 
observations.  Despite  using  less  information  than  the 
inside-outside  algorithm,  the  algorithms  perform  almost 
as  well. 


Introduction 

Although  natural  languages  are  not  entirely  context  free, 
stochastic  context-free  grammars  (SCEGs)  are  an  effective 
representation  for  capturing  much  of  their  structure.  How¬ 
ever,  for  embedded  agents,  most  algorithms  for  learning 
SCEGs  from  data  have  two  shortcomings.  Eirst,  they  need 
access  to  a  corpus  of  complete  sentences,  requiring  the  agent 
to  retain  every  sentence  it  hears.  Second,  they  are  batch  al¬ 
gorithms  that  make  repeated  passes  over  the  data,  often  re¬ 
quiring  significant  computation  in  each  pass.  These  short¬ 
comings  are  addressed  through  two  online  algorithms  called 
Span'  and  Prespan^  that  learn  the  parameters  of  SCEGs 
using  only  summary  statistics  in  combination  with  repeated 
sampling  techniques. 

SCEGs  contain  both  structure  (i.e.  rules)  and  parameters 
(i.e.  rule  probabilities).  One  approach  to  learning  SCEGs 
from  data  is  to  start  with  a  grammar  containing  all  possible 
rules  that  can  be  created  from  some  alphabet  of  terminals 
and  non-terminals.  Typically  the  size  of  the  right-hand-side 
of  each  rule  is  bound  by  a  small  constant  (e.g.  2).  Then  an 
algorithm  for  learning  parameters  is  applied  and  allowed  to 
“prune”  rules  by  setting  their  expansion  probabilities  to  zero 
(Lari  &.  Young,  1990).  Prespan  and  Span  operate  in  this 
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'  Span  stands  for  Sample  Parse  Adjust  Normalize 

^Prespan  is  so  named  because  it  is  the  predecessor  of  SPAN, 
so  it  literally  means  pre-SPAN 
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paradigm  by  assuming  a  fixed  structure  and  modifying  the 
parameters. 

Given  a  SCEG  to  be  learned,  both  algorithms  have  ac¬ 
cess  to  the  structure  of  the  grammar  and  a  set  of  sentences 
generated  by  the  grammar.  The  correct  parameters  are  un¬ 
known.  Prespan  and  Span  begin  by  parsing  the  sentence 
corpus  using  a  chart  parser.  Note  that  the  parse  of  an  indi¬ 
vidual  sentence  does  not  depend  on  the  parameters;  it  only 
depends  on  the  structure.  However,  the  distribution  of  sen¬ 
tences  parsed  does  depend  on  the  parameters  of  the  grammar 
used  to  generate  them.  Both  algorithms  associate  with  each 
rule  a  histogram  that  records  the  number  of  times  the  rule  is 
used  in  parses  of  the  individual  sentences. 

Prespan  and  Span  make  an  initial  guess  at  the  values 
of  the  parameters  by  setting  them  randomly.  They  then  gen¬ 
erate  a  corpus  of  sentences  with  these  parameters  and  parse 
them,  resulting  in  a  second  set  of  histograms.  The  degree 
to  which  the  two  sets  of  histograms  differ  is  a  measure  of 
the  difference  between  the  current  parameter  estimates  and 
the  target  parameters.  Prespan  modifies  its  parameter  es¬ 
timates  so  the  sum  total  difference  between  the  histograms 
is  minimized.  In  contrast.  Span  modifies  its  estimates  so 
the  difference  between  individual  histograms  is  minimized. 
Empirical  results  show  that  this  procedure  yields  parameters 
that  are  close  to  those  found  by  the  inside-outside  algorithm. 

Stochastic  Context-Free  Grammars 

Stochastic  context-free  grammars  ^  are  the  natural  exten¬ 
sion  of  Context-Eree  Grammars  to  the  probabilistic  do¬ 
main  (Sipser,  1997;  Charniak,  1993).  Said  differently, 
they  are  context-free  grammars  with  probabilities  associ¬ 
ated  with  each  rule.  Eormally,  a  SCEG  is  a  four-tuple 
M  =  (U,  E,  R,  S)  where 

1 .  U  is  a  finite  set  of  non-terminals 

2.  E  is  a  finite  set,  disjoint  from  V,  of  terminals 

3.  i?  is  a  finite  set  of  rules  of  the  form  A  ^  w  where  A 
belongs  to  V,  and  w  is  a  finite  string  composed  of  ele¬ 
ments  from  V  and  E.  We  refer  to  A  as  the  left-hand  side 
(LHS)  of  the  rule  and  w  as  the  right-hand  side  (RHS), 
or  expansion,  of  the  rule.  Additionally,  each  rule  r  has  an 

^Stochastic  Context-Free  Grammars  are  often  called  Probabilis¬ 
tic  Context-Free  Grammars. 
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associated  probability  p(r)  such  that  the  probabilities  of 
rules  with  the  same  left-hand  side  sum  to  1 . 

4.  S  is  the  start  symbol. 

Grammars  can  either  be  ambiguous  or  unambiguous. 
Ambiguous  grammars  can  generate  the  same  string  in  mul¬ 
tiple  ways.  Unambiguous  grammars  cannot. 

Learning  Stochastic-Context  Free  Grammars 

Learning  context-free  grammars  is  the  problem  of  inducing 
a  context-free  structure  (or  model)  from  a  corpus  of  sen¬ 
tences  (i.e.,  data).  When  the  grammars  are  stochastic,  one 
faces  the  additional  problem  of  learning  rule  probabilities 
(parameters)  from  the  corpus.  Given  a  set  of  sentence  ob¬ 
servations  O  =  {oq  . .  .On-i},  the  goal  is  to  discover  the 
grammar  that  generated  O.  Typically,  this  problem  is  framed 
in  terms  of  a  search  in  grammar  space  where  the  objective 
function  is  the  likelihood  of  the  data  given  the  grammar. 
While  the  problem  of  incrementally  learning  the  structure 
of  SCFGs  is  interesting  in  its  own  right,  the  main  focus  here 
is  on  learning  parameters.  For  a  thorough  overview  of  ex¬ 
isting  techniques  for  learning  structure,  see  (Stolcke,  1994; 
Chen,  1996;  Nevill-Manning  &  Witten,  1997). 

Learning  Parameters 

The  inside-outside  algorithm  (Lari  &  Young,  1990;  Lari  & 
Young,  1991)  is  the  standard  method  for  estimating  param¬ 
eters  in  SCFGs.  The  algorithm  uses  the  general-purpose 
expectation-maximization  (EM)  procedure.  Almost  all  pa¬ 
rameter  learning  is  done  batch  style  using  some  version  of 
the  inside-outside  algorithm.  For  example,  in  learning  pa¬ 
rameters,  (Chen,  1995)  initially  estimates  the  rule  probabil¬ 
ities  using  the  most  probable  parse  of  the  sentences  given 
the  grammar  (the  Viterbi  parse)  and  then  uses  a  “post-pass” 
procedure  that  incorporates  the  inside-outside  algorithm. 

To  use  EM  the  entire  sentence  corpus  must  be  stored. 
While  this  storage  may  not  be  in  the  form  of  actual  sen¬ 
tences,  it  is  always  in  some  representation  that  easily  al¬ 
lows  the  reconstruction  of  the  original  corpus  (e.g.,  the  chart 
of  a  chart  parse).  Because  we  are  interested  in  language 
acquisition  in  embedded  agents  over  long  periods  of  time, 
the  prospect  of  memorizing  and  repeatedly  processing  en¬ 
tire  sentence  corpora  is  unpalatable. 

This  motivation  also  carries  a  desire  to  easily  adjust  our 
parameters  when  new  sentences  are  encountered.  That  is,  we 
want  to  learn  production  probabilities  incrementally.  While 
the  inside-outside  algorithm  can  incorporate  new  sentences 
in  between  iterations,  it  still  uses  the  entire  sentence  corpus 
for  estimation. 

The  Algorithms 

Prespan  and  Span  address  some  of  the  concerns  given 
in  the  previous  section.  Both  are  unsupervised,  incremen¬ 
tal  algorithms  for  finding  parameter  estimates  in  stochastic 
context-free  grammars. 

Prespan  and  Span  are  incremental  in  two  ways.  Eirst,  in 
the  classical  sense,  at  every  iteration  they  use  the  previously 
learned  parameter  estimation  as  a  stepping  stone  to  the  new 


Table  1 :  A  grammar  that  generates  palindromes 
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01  01  01  01 
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01  01  01  01234 

C^SA  D^SB  B^Z  A-AY 


Eigure  1:  The  palindrome  grammar  rule  histograms  after 
only  one  parse  of  the  sentence  y  y  y  y.  Because  only  one 
sentence  has  been  parse,  the  mass  of  the  distribution  is  con¬ 
centrated  in  a  single  bin. 

one.  Second,  both  algorithms  naturally  allow  new  data  to 
contribute  to  learning  without  restarting  the  entire  process. 

Prespan  and  Span  use  only  a  statistical  summary  of  the 
observation  data  for  learning.  Both  store  the  summary  infor¬ 
mation  in  histograms.  Span  and  Prespan  also  record  his¬ 
togram  information  about  their  current  parameter  estimates. 
So  the  addition  of  new  sentences  typically  does  not  increase 
the  memory  requirements.  Eurthermore,  the  histograms  play 
a  crucial  role  in  learning.  If  the  parameter  estimates  are  ac¬ 
curate,  the  histograms  of  the  observed  data  should  resemble 
the  histograms  of  the  current  parameterization.  When  the 
histograms  do  not  resemble  each  other,  the  difference  is  used 
to  guide  the  learning  process. 

A  Description  of  Pres  pan 

Let  T  =  {V,  Y,  R,  S)  be  a  SCEG  and  let  O  =  {oq  . . .  o„_i} 
be  a  set  of  n  sentences  generated  stochastically  from  T.  Let 
M  =  (U,  Y,  i?',  S')  be  a  SCEG  that  is  the  same  as  T  except 
the  rule  probabilities  in  R'  have  been  assigned  at  random 
(subject  to  the  constraint  that  the  sum  of  the  probabilities  for 
all  rules  with  the  same  left-hand  side  is  one).  T  is  called  the 
target  grammar  and  M  the  learning  grammar.  The  goal  is  to 
use  a  statistical  summary  of  O  to  obtain  parameters  for  M 
that  are  as  close  to  the  unknown  parameters  of  T  as  possible. 

Using  M  and  a  standard  chart  parsing  algorithm  (e.g., 
Charniak,  1993  or  Allen,  1995)  one  can  parse  a  sentence  and 
count  how  many  times  a  particular  rule  was  used  in  deriving 
that  sentence.  Let  each  rule  r  in  a  grammar  have  two  asso¬ 
ciated  histograms  called  and  is  constructed 

by  parsing  each  sentence  in  O  and  recording  the  number  of 
times  rule  r  appears  in  the  parse  tree  of  the  sentence.  Flis- 
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Figure  2:  The  palindrome  grammar  rule  histograms  after 
parsing  y  y  y  y  and  y  y.  Notice  that  the  mass  of  rules 
S  ^  AC,  and  C  — ?>  5  A  are  now  evenly  distributed  be¬ 
tween  0  and  1.  Similarly  the  mass  of  rule  5  — ?>  Y  is  evenly 
distributed  between  2  and  4. 


tograms  constructed  in  this  way  are  called  observation  his¬ 
tograms.  The  indices  of  the  histogram  range  from  0  to  k 
where  k  is  the  maximum  number  of  times  a  rule  was  used  in 
a  particular  sentence  parse.  In  many  cases,  k  remains  small, 
and  more  importantly,  when  a  sentence  parse  does  not  in¬ 
crease  k,  the  storage  requirements  remain  unchanged. 

Each  is  a  histogram  identical  in  nature  to  but 
is  used  during  the  learning  process,  so  it’s  a  learning  his¬ 
togram.  Like  the  observation  histograms.  Prespan  uses 
each  learning  histogram  to  record  the  number  of  times  each 
rule  occurs  in  single  sentence  parses  during  the  learning  pro¬ 
cess.  The  difference  is  that  the  corpus  of  sentences  parsed  to 
fill  is  generated  stochastically  from  M  using  its  current 
parameters. 

For  example,  suppose  Prespan  is  provided  with  the 
palindrome-generating  structure  given  in  Table  1  and  en¬ 
counters  the  sentence  y  y  y  y.  Chart  parsing  the  sentence 
reveals  that  rule  A  ^  Y  has  frequency  4,  rules  S  ^  A  A, 
S  ^  AC  and  S'  — ?>  A  have  frequency  1 ,  and  the  remaining 
non-terminals  have  frequency  0.  Figure  1  depicts  graphi¬ 
cally  the  histograms  for  each  rule  after  parsing  the  sentence. 
In  parsing  the  sentence  y  y,  the  rule  S  A  A  is  used  once, 

— ;>  Y  is  used  twice  and  the  other  rules  are  not  used.  Fig¬ 
ure  2  shows  how  the  histograms  in  Figure  1  change  after 
additionally  parsing  y  y. 

After  every  sentence  parse,  Prespan  updates  the  obser¬ 
vation  histograms  and  discards  the  sentence  along  with  its 
parse.  It  is  left  with  only  a  statistical  summary  of  the  corpus. 
As  a  result,  one  cannot  reconstruct  the  observation  corpus  or 
any  single  sentence  within  it.  From  this  point  forward  the 
observation  histograms  are  updated  only  when  new  data  is 
encountered. 

Prespan  now  begins  the  iterative  process.  First,  it  ran¬ 
domly  generates  a  small  sentence  corpus  of  prespecified 
constant  size  s  from  its  learning  grammar  M .  Each  sentence 
in  the  sample  is  parsed  using  a  chart  parser.  Using  the  chart. 
Prespan  records  summary  statistics  exactly  as  it  did  for  the 
observation  corpus  except  the  statistics  for  each  rule  r  are 
added  to  the  learning  histograms  instead  of  the  observation 
histograms.  After  discarding  the  sentences,  the  learning  his¬ 
tograms  are  normalized  to  some  fixed  size  h.  Without  nor¬ 
malization,  the  information  provided  by  the  new  statistics 
would  have  decreasing  impact  on  the  histograms’  distribu¬ 


tions.  This  is  because  the  bin  counts  typically  increase  lin¬ 
early  while  the  sample  size  remains  constant.  Future  work 
will  examine  the  role  of  the  normalization  factor,  however, 
for  this  work  it  is  kept  fixed  throughout  the  duration  of  the 
algorithm. 

For  each  rule  r  Prespan  now  has  two  distributions:  , 

based  on  the  corpus  generated  from  T,  and  based  on 
the  corpus  generated  from  M.  Comparing  to  seems 
a  natural  predictor  of  the  likelihood  of  the  observation  cor¬ 
pus  given  Prespan’s  learning  grammar.  Relative  entropy 
(also  known  as  the  Kullback-Leibler  distance)  is  commonly 
used  to  compare  two  distributions  p  and  q  (Cover  &  Thomas, 
1991).  It  is  defined  as: 

D{p\\q)  = 

Because  two  distributions  are  associated  with  each  rule  r, 
the  relative  entropies  are  summed  over  the  rules. 

q^{x) 

r  X  j  \  / 

If  r  decreases  between  iterations,  then  the  likelihood  of 
M  is  increasing  so  Prespan  increases  the  probabilities  of 
the  rules  used  in  generating  the  sample  corpus.  When  s  is 
large,  the  algorithm  only  increases  a  small  subset  of  the  rules 
used  to  generate  the  sample."^  Likewise,  if  r  increases  be¬ 
tween  iterations.  Prespan  decreases  the  rule  probabilities. 

Prespan  uses  a  multiplicative  update  function.  Suppose 
rule  r  was  selected  for  an  update  at  time  t.  If  pt(r)  is  the 
probability  of  r  at  time  t  and  r  decreased  between  itera¬ 
tions,  thenpt+i(r)  =  1.01  *  pt{r).  Once  the  probability 
updates  are  performed  Prespan  starts  another  iteration  be¬ 
ginning  with  the  generation  of  a  small  sentence  corpus  from 
the  learning  grammar.  The  algorithm  stops  iterating  when 
the  relative  entropy  falls  below  a  threshold,  or  some  pre¬ 
specified  number  of  iterations  has  completed. 

A  Description  of  Span 

Span  differs  from  Prespan  in  the  selection  of  rules  to  up¬ 
date,  the  criteria  for  updates,  and  the  update  rule  itself.  Re¬ 
call  that  Prespan  uses  r  (see  Equation  1),  the  sum  of  rel¬ 
ative  entropy  calculations  for  each  rule,  as  a  measure  of 
progress  or  deterioration  of  grammar  updates.  Since  r  is  an 
aggregate  value,  an  unsuccessful  change  in  probability  for 
one  rule  could  overshadow  a  successful  change  of  another 
rule.  Furthermore,  the  update  rule  does  not  differentiate  be¬ 
tween  small  and  large  successes  and  failures. 

Span  addresses  these  concerns  by  examining  local 
changes  in  relative  entropy  and  using  those  values  to  make 
rule  specific  changes.  Span  calculates  the  relative  entropy 
for  rule  r  at  time  t  and  compares  it  with  the  relative  entropy 
at  time  t  —  1.  If  the  relative  entropy  decreases  it  means  S  PAN 
updated  the  rule  probability  favorably,  if  it  increases,  the  dis¬ 
tributions  have  become  more  dissimilar  so  the  probability 

“^Using  only  the  rules  fired  during  the  generation  of  the  last  sen¬ 
tence  seems  to  works  well. 


should  move  in  the  opposite  direction.  This  is  best  explained 

by  examining  Span’s  update  rule:  Table  2:  A  grammar  generating  simple  English  phrases 


Pt+i{r)  =  Pt{r) 


a  *  sgn(Pt(r)  -  Pt-i(r)) 

*  sgn(ARFt) 

*  f(APEt) 

(]  *  {Pt{r)  -  Pt-i{r)) 


(2) 

The  update  rule  is  based  on  the  steepest  descent  method 
(Bertsekas  &  Tsitsiklis,  1996).  Here,  sgn  is  the  “sign”  func¬ 
tion  that  returns  -1  if  its  argument  is  negative,  0  if  its  ar¬ 
gument  is  zero  and  H-1  if  its  argument  is  positive.  The  first 
sign  function  determines  the  direction  of  the  previous  up¬ 
date.  That  is,  it  determines  whether,  in  the  last  time  step. 
Span  increased  or  decreased  the  probability.  The  second 
sign  function  determines  if  the  relative  entropy  has  increased 
or  decreased.  If  it  has  decreased,  then  the  difference  is  posi¬ 
tive,  if  it  increased,  the  difference  is  negative.  Together  these 
sign  functions  determine  the  direction  of  the  step.  The  func¬ 
tion  f{AREt)  returns  the  magnitude  of  the  step.  This,  intu¬ 
itively,  is  an  estimate  of  the  gradient  since  the  magnitude  of 
the  change  in  relative  entropy  is  reflective  of  the  slope.  The 
a  parameter  is  a  step-size.  Finally  the  /3  *  {Pt{r)  —  Pt-i{r)) 
expression  is  a  momentum  term. 

Once  the  probability  updates  are  performed  for  each  rule, 
another  iteration  starts  beginning  with  the  generation  of  a 
small  sentence  corpus  from  the  learning  grammar.  Like 
Prespan,  the  algorithm  stops  iterating  when  the  relative  en¬ 
tropy  falls  below  a  threshold,  or  some  prespecified  number 
of  iterations  has  completed. 

Span  is  a  more  focused  learning  algorithm  than  Pres- 
PAN.  This  is  because  all  rules  are  individually  updated  based 
on  local  changes  instead  of  stochastically  selected  and  up¬ 
dated  based  on  global  changes.  The  algorithmic  changes 
speed  up  learning  by,  at  times,  two  orders  of  magnitude. 
While  these  benefits  drastically  increase  learning  time,  they 
do  not  necessarily  result  in  more  accurate  grammars.  Ev¬ 
idence  and  explanation  of  this  is  given  in  the  Experiments 
section. 


Algorithm  Analysis 

In  this  section  both  the  time  and  space  requirements  of  the 
algorithms  are  analyzed.  Comparing  the  results  with  the 
time  and  space  requirements  of  the  inside-outside  algorithm 
shows  that  Span  and  Prespan  are  asymptotically  equiva¬ 
lent  in  time  but  nearly  constant  (as  opposed  to  linear  with 
inside-outside)  in  space. 

The  inside-outside  algorithm  runs  in  0(C^\V\^)  time 
where  C  is  the  length  of  sentence  corpus  and  |E|  is  the  num¬ 
ber  of  non-terminal  symbols  (Stolcke,  1994).  The  complex¬ 
ity  arises  directly  from  the  chart  parsing  routines  used  to  es¬ 
timate  probabilities.  Note  that  the  number  of  iterations  used 
by  the  inside-outside  algorithm  is  dominated  by  the  compu¬ 
tational  complexity  of  chart  parsing. 

Both  Span  and  Prespan  chart  parse  the  observation  cor¬ 
pus  once  but  repeatedly  chart  parse  the  fixed  size  samples 
they  generates  during  the  learning  process.  Taken  as  a 
whole,  this  iterative  process  typically  dominates  the  single 
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parse  of  the  observations  sentences,  so  the  computational 
complexity  is  0(  J^\V\^)  where  J  is  the  length  of  of  the 
maximum  sample  of  any  iteration. 

Every  iteration  of  the  inside-outside  algorithm  requires 
the  complete  sentence  corpus.  Using  the  algorithm  in  the 
context  of  embedded  agents,  where  the  sentence  corpus  in¬ 
creases  continuously  with  time,  means  a  corresponding  con¬ 
tinuous  increase  in  memory.  With  Span  and  Prespan,  the 
memory  requirements  remain  effectively  constant. 

While  the  algorithms  continually  update  their  learning 
histograms  through  the  learning  process,  the  number  of  bins 
increases  only  when  a  sentence  parse  contains  an  occurrence 
count  larger  than  any  encountered  previously.  The  sample  is 
representative  of  the  grammar  parameters  and  structure,  so 
typically  after  a  few  iterations,  the  number  of  bins  becomes 
stable.  This  means  that  when  new  sentences  are  encoun¬ 
tered  there  is  typically  no  increase  in  the  amount  of  space 
required. 


Experiments 

The  previous  section  described  two  online  algorithms  for 
learning  the  parameters  of  SCFGs  given  summary  statistics 
computed  from  a  corpus  of  sentences.  The  remaining  ques¬ 
tion  is  whether  the  quality  of  the  learned  grammar  is  sac¬ 
rificed  because  a  statistical  summary  of  the  information  is 
used  rather  than  the  complete  sentence  corpus.  This  section 
presents  the  results  of  experiments  that  compare  the  gram¬ 
mars  learned  with  Prespan  and  Span  with  those  learned 
by  the  inside-outside  algorithm. 

The  following  sections  provide  experimental  results  for 
both  the  PRESPAN  and  Span  algorithms. 

Experiments  with  Pres  pan 

Let  be  the  target  grammar  whose  parameters  are  to  be 
learned.  Let  be  a  grammar  that  has  the  same  structure 


Table  3:  An  ambiguous  grammar 
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as  M'^  but  with  rule  probabilities  initialized  uniformly  ran¬ 
dom  and  normalized  so  the  sum  of  the  probabilities  of  the 
rules  with  the  same  left-hand  side  is  1.0.  Let  be  a  set 
of  sentences  generated  stochastically  from  .  The  perfor¬ 
mance  of  the  algorithm  is  compared  by  running  it  on  M  ^ 
and  and  computing  the  log  likelihood  of  O  ^  given  the 
final  grammar. 

Because  the  algorithm  learns  parameters  for  a  fixed  struc¬ 
ture,  a  number  of  different  target  grammars  are  used  in  ex¬ 
perimentation;  each  with  the  same  structure  but  different 
rule  probabilities.  The  goal  is  to  determine  whether  any  re¬ 
gions  of  parameter  space  were  significantly  better  for  one 
algorithm  over  the  other.  This  is  accomplished  by  stochas¬ 
tically  sampling  from  this  space.  Note  that  a  new  corpus  is 
generated  for  each  new  set  of  parameters  as  they  influence 
which  sentences  are  generated. 

The  grammar  shown  in  Table  2  (Stolcke,  1994)  was  used 
in  this  manner  with  50  different  target  parameter  settings  and 
500  sentences  in  for  each  setting.  The  mean  and  stan¬ 
dard  deviation  of  the  log  likelihoods  for  PRESPAN  with 
h  =  s  =  100  (histogram  size  and  learning  corpus  size  re¬ 
spectively)  were  p  =  —962.58  and  a  =  241.25.  These  val¬ 
ues  for  the  inside-outside  algorithm  were  p  =  —959.83  and 
a  =  240.85.  Recall  that  equivalent  performance  would  be 
a  significant  accomplishment  because  the  online  algorithm 
has  access  to  much  less  information  about  the  data.  Suppose 
the  means  of  both  empirical  distributions  are  equal.  With 
this  assumption  as  the  null  hypothesis,  a  two-tailed  t-test  re¬ 
sults  in  p  =  0.95.  This  means  that  if  one  rejects  the  null 
hypothesis,  the  probability  of  making  an  error  is  0.95. 

Unfortunately,  the  above  result  does  not  sanction  the  con¬ 
clusion  that  the  two  distributions  are  the  same.  One  can, 
however,  look  at  the  power  of  the  test  in  this  case.  If  the 
test’s  power  is  high  then  it  is  likely  that  a  true  difference  in 
the  means  would  be  detected.  If  the  power  is  low  then  it 
is  unlikely  that  the  test  would  detect  a  real  difference.  The 
power  of  a  test  depends  on  a  number  of  factors,  including  the 
sample  size,  the  standard  deviation,  the  significance  level  of 
the  test,  and  the  actual  difference  between  the  means.  Given 
a  sample  size  of  50,  a  standard  deviation  of  240.05,  a  sig¬ 
nificance  level  of  0.05,  and  an  actual  delta  of  174.79,  the 
power  of  the  t-test  is  0.95.  That  is,  with  probability  0.95  the 
t-test  will  detect  a  difference  in  means  of  at  least  174.79  at 


the  given  significance  level.  Because  the  mean  of  the  two 
distributions  is  minute,  a  more  powerful  test  is  needed. 

Since  both  Prespan  and  the  inside-outside  algorithm 
were  run  on  the  same  problems,  a  paired  sample  t-test  can 
be  applied.  This  test  is  more  powerful  than  the  standard  t- 
test.  Suppose  again  the  the  means  of  the  two  distributions 
are  equal.  Using  this  as  the  null  hypothesis  and  performing 
the  paired  sample  t-test  yields  p  <  0.01.  That  is,  the  prob¬ 
ability  of  making  an  error  in  rejecting  the  null  hypothesis  is 
less  than  0.01.  Closer  inspection  of  the  data  reveals  why  this 
is  the  case,  inside-outside  performed  better  than  the  online 
algorithm  on  each  of  the  50  grammars.  However,  as  is  ev¬ 
ident  from  the  means  and  standard  deviation,  the  absolute 
difference  in  each  case  was  quite  small. 

The  same  experiments  were  conducted  with  the  ambigu¬ 
ous  grammar  shown  in  Table  3.  The  grammar  is  ambigu¬ 
ous  because,  for  example,  z  z  can  be  generated  byS'-?- 
A— T-zzorS'  — ?>  BA  with  B  ^  S  ^  C  ^  z 
and  A  — ;>  G  — 7>  z.  The  mean  and  standard  deviation  of 
the  log  likelihoods  for  Prespan  were  p  =  —1983.15  and 
cr  =  250.95.  These  values  for  the  inside-outside  algorithm 
were  p  =  —1979.37  and  cr  =  250.57.  The  standard  t-test  re¬ 
turned  a  p  value  of  0.94  and  the  paired  sample  t-test  was  sig¬ 
nificant  at  the  0.01  level.  Again,  inside-outside  performed 
better  on  every  one  of  the  50  grammars,  but  the  differences 
were  very  small. 

Experiments  with  Span 

The  same  experiments  were  performed  using  Span.  That 
is,  the  grammar  in  Table  2  was  used  with  50  different  tar¬ 
get  parameter  settings  and  500  sentences  in  for  each 
setting.  The  mean  and  standard  deviation  of  the  log  like¬ 
lihoods  for  the  Span  with  /i  =  s  =  100  (histogram  size 
and  learning  corpus  size  respectively)  were  p  =  —4266.44 
and  cr  =  650.57.  These  values  for  the  inside-outside  algo¬ 
rithm  were  p  =  —3987.58  and  cr  =  608.59.  Recall  that 
equivalent  performance  would  be  a  significant  accomplish¬ 
ment  because  the  online  algorithm  has  access  to  much  less 
information  about  the  data.  Assuming  both  the  means  of  the 
distributions  are  equal  and  using  this  as  the  null  hypothesis 
of  a  two-tailed  t-test  results  in  p  =  0.03. 

The  same  experiment  was  conducted  with  the  ambiguous 
grammar  shown  in  table  3.  The  grammar  is  ambiguous,  for 
example,  because  z  z  could  be  generated  by  S'  — ?>  A  — ;> 
GG  — 7>  ZjZorS  — 7>  BA  with  B  ^  S  ^  C  ^  z  and 
A  — ;>  G  — 7>  z.  The  mean  and  standard  deviation  of  the  log 
likelihoods  for  the  online  algorithm  were  p  =  —2025.93  and 
cr  =  589.78.  These  values  for  the  inside-outside  algorithm 
were  p  =  —1838.41  and  cr  =  523.46.  The  t-test  returned  a 
p  value  of  0.33. 

Inside-outside  performed  significantly  better  on  the  un¬ 
ambiguous  grammar  but  there  was  not  a  significant  differ¬ 
ence  on  the  ambiguous  grammar.  Given  the  fact  that  Span 
has  access  to  far  less  information  than  the  inside-outside  al¬ 
gorithm,  this  is  not  a  trivial  accomplishment.  One  conjec¬ 
ture  is  that  Span  never  actually  converges  to  a  stable  set 
of  parameters  but  walks  around  whatever  local  optimum  it 
finds  in  parameter  space.  This  is  suggested  by  the  obser¬ 
vation  that  for  any  given  training  set  the  log  likelihood  for 


the  inside-outside  algorithm  is  always  higher  than  that  for 
Span.  Comparison  of  the  parameters  learned  shows  that 
Span  is  moving  in  the  direction  of  the  correct  parameters 
but  that  it  never  actually  converges  on  them. 

Discussion 

It  was  noted  earlier  that  Span  learns  more  quickly  than 
Prespan  but  the  Experiments  section  shows  this  improve¬ 
ment  may  come  at  a  cost.  One  reason  for  this  may  lie  in  the 
the  sentence  samples  produced  from  the  learning  grammar 
during  each  iteration.  Recall  that  Span  learns  by  generat¬ 
ing  a  sentence  sample  using  its  current  parameter  estimates. 
Then  this  sample  is  parsed  and  the  distribution  is  compared 
to  the  distribution  of  the  sentences  generated  from  the  tar¬ 
get  grammar.  Each  sentence  sample  reflects  the  current  pa¬ 
rameter  estimates,  but  also  has  some  amount  of  error.  This 
error  may  be  more  pronounced  in  Span  because  at  each  iter¬ 
ation,  every  rule  is  updated.  This  update  is  a  direct  function 
of  statistics  computed  from  the  sample,  so  the  sample  error 
may  overshadow  actual  improvement  or  deterioration  in  pa¬ 
rameter  updates  from  the  last  iteration. 

Overcoming  the  sample  error  problem  in  Span  might  be 
accomplished  by  incorporating  global  views  of  progress,  not 
unlike  those  used  in  Prespan.  In  fact,  a  synergy  of  the  two 
algorithms  may  be  an  appropriate  next  step  in  this  research. 

Another  interesting  prospect  for  future  parameter¬ 
learning  research  is  based  on  rule  orderings.  Remember  that 
parameter  changes  in  rules  closer  to  the  start-symbol  of  a 
grammar  have  more  effect  on  the  overall  distribution  of  sen¬ 
tences  than  changes  to  parameters  farther  away.  One  idea  is 
to  take  the  grammar,  transform  it  into  a  graph  so  that  each 
unique  left-hand  side  symbol  is  a  vertex  and  each  individual 
right-hand-side  symbol  is  a  weighted  arc.  Using  the  start- 
symbol  vertex  as  the  root  node  and  assuming  each  arc  has 
weight  1 .0,  one  can  assign  a  rank  to  each  vertex  by  finding 
the  weight  of  the  shortest  path  from  the  root  to  all  the  other 
vertices.  This  ordering  may  provide  a  convenient  way  to 
iteratively  learn  the  rule  probabilities.  One  can  imagine  con¬ 
centrating  only  on  learning  the  parameters  of  the  rules  with 
rank  1,  then  fixing  those  parameters  and  working  on  rules 
with  rank  2,  and  so  forth.  When  the  final  rank  is  reached, 
the  process  would  start  again  from  the  beginning.  Clearly 
self-referential  rules  may  pose  some  difficulty,  but  the  ideas 
have  yet  to  be  fully  examined. 

Conclusion 

Most  parameter  learning  algorithms  for  stochastic  context- 
free  grammars  retain  the  entire  sentence  corpus  through¬ 
out  the  learning  process.  Incorporating  a  complete  memory 
of  sentence  corpora  seems  ill-suited  for  learning  in  embed¬ 
ded  agents.  Prespan  and  SPAN  are  two  incremental  al¬ 
gorithms  for  learning  parameters  in  stochastic  context-free 
grammars  using  only  summary  statistics  of  the  observed 
data.  Both  algorithms  require  a  fixed  amount  of  space  re¬ 
gardless  of  the  number  of  sentences  they  processes.  Despite 
using  much  less  information  than  the  inside-outside  algo¬ 
rithm,  Prespan  and  Span  perform  almost  as  well. 
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